Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 4.341
Filter
1.
PLoS One ; 19(5): e0301262, 2024.
Article in English | MEDLINE | ID: mdl-38722864

ABSTRACT

Frequent sequence pattern mining is an excellent tool to discover patterns in event chains. In complex systems, events from parallel processes are present, often without proper labelling. To identify the groups of events related to the subprocess, frequent sequential pattern mining can be applied. Since most algorithms provide too many frequent sequences that make it difficult to interpret the results, it is necessary to post-process the resulting frequent patterns. The available visualisation techniques do not allow easy access to multiple properties that support a faster and better understanding of the event scenarios. To answer this issue, our work proposes an intuitive and interactive solution to support this task, introducing three novel network-based sequence visualisation methods that can reduce the time of information processing from a cognitive perspective. The proposed visualisation methods offer a more information rich and easily understandable interpretation of sequential pattern mining results compared to the usual text-like outcome of pattern mining algorithms. The first uses the confidence values of the transitions to create a weighted network, while the second enriches the adjacency matrix based on the confidence values with similarities of the transitive nodes. The enriched matrix enables a similarity-based Multidimensional Scaling (MDS) projection of the sequences. The third method uses similarity measurement based on the overlap of the occurrences of the supporting events of the sequences. The applicability of the method is presented in an industrial alarm management problem and in the analysis of clickstreams of a website. The method was fully implemented in Python environment. The results show that the proposed methods are highly applicable for the interactive processing of frequent sequences, supporting the exploration of the inner mechanisms of complex systems.


Subject(s)
Algorithms , Data Mining/methods , Humans
2.
BMC Plant Biol ; 24(1): 373, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38714965

ABSTRACT

BACKGROUND: As one of the world's most important beverage crops, tea plants (Camellia sinensis) are renowned for their unique flavors and numerous beneficial secondary metabolites, attracting researchers to investigate the formation of tea quality. With the increasing availability of transcriptome data on tea plants in public databases, conducting large-scale co-expression analyses has become feasible to meet the demand for functional characterization of tea plant genes. However, as the multidimensional noise increases, larger-scale co-expression analyses are not always effective. Analyzing a subset of samples generated by effectively downsampling and reorganizing the global sample set often leads to more accurate results in co-expression analysis. Meanwhile, global-based co-expression analyses are more likely to overlook condition-specific gene interactions, which may be more important and worthy of exploration and research. RESULTS: Here, we employed the k-means clustering method to organize and classify the global samples of tea plants, resulting in clustered samples. Metadata annotations were then performed on these clustered samples to determine the "conditions" represented by each cluster. Subsequently, we conducted gene co-expression network analysis (WGCNA) separately on the global samples and the clustered samples, resulting in global modules and cluster-specific modules. Comparative analyses of global modules and cluster-specific modules have demonstrated that cluster-specific modules exhibit higher accuracy in co-expression analysis. To measure the degree of condition specificity of genes within condition-specific clusters, we introduced the correlation difference value (CDV). By incorporating the CDV into co-expression analyses, we can assess the condition specificity of genes. This approach proved instrumental in identifying a series of high CDV transcription factor encoding genes upregulated during sustained cold treatment in Camellia sinensis leaves and buds, and pinpointing a pair of genes that participate in the antioxidant defense system of tea plants under sustained cold stress. CONCLUSIONS: To summarize, downsampling and reorganizing the sample set improved the accuracy of co-expression analysis. Cluster-specific modules were more accurate in capturing condition-specific gene interactions. The introduction of CDV allowed for the assessment of condition specificity in gene co-expression analyses. Using this approach, we identified a series of high CDV transcription factor encoding genes related to sustained cold stress in Camellia sinensis. This study highlights the importance of considering condition specificity in co-expression analysis and provides insights into the regulation of the cold stress in Camellia sinensis.


Subject(s)
Camellia sinensis , Camellia sinensis/genetics , Camellia sinensis/metabolism , Cluster Analysis , Genes, Plant , Gene Expression Profiling/methods , Data Mining/methods , Transcriptome , Gene Expression Regulation, Plant , Gene Regulatory Networks
3.
PLoS One ; 19(5): e0302595, 2024.
Article in English | MEDLINE | ID: mdl-38718024

ABSTRACT

Diabetes Mellitus is one of the oldest diseases known to humankind, dating back to ancient Egypt. The disease is a chronic metabolic disorder that heavily burdens healthcare providers worldwide due to the steady increment of patients yearly. Worryingly, diabetes affects not only the aging population but also children. It is prevalent to control this problem, as diabetes can lead to many health complications. As evolution happens, humankind starts integrating computer technology with the healthcare system. The utilization of artificial intelligence assists healthcare to be more efficient in diagnosing diabetes patients, better healthcare delivery, and more patient eccentric. Among the advanced data mining techniques in artificial intelligence, stacking is among the most prominent methods applied in the diabetes domain. Hence, this study opts to investigate the potential of stacking ensembles. The aim of this study is to reduce the high complexity inherent in stacking, as this problem contributes to longer training time and reduces the outliers in the diabetes data to improve the classification performance. In addressing this concern, a novel machine learning method called the Stacking Recursive Feature Elimination-Isolation Forest was introduced for diabetes prediction. The application of stacking with Recursive Feature Elimination is to design an efficient model for diabetes diagnosis while using fewer features as resources. This method also incorporates the utilization of Isolation Forest as an outlier removal method. The study uses accuracy, precision, recall, F1 measure, training time, and standard deviation metrics to identify the classification performances. The proposed method acquired an accuracy of 79.077% for PIMA Indians Diabetes and 97.446% for the Diabetes Prediction dataset, outperforming many existing methods and demonstrating effectiveness in the diabetes domain.


Subject(s)
Diabetes Mellitus , Machine Learning , Humans , Diabetes Mellitus/diagnosis , Algorithms , Data Mining/methods , Support Vector Machine , Male
4.
J Med Internet Res ; 26: e48572, 2024 May 03.
Article in English | MEDLINE | ID: mdl-38700923

ABSTRACT

BACKGROUND: Adverse drug reactions (ADRs), which are the phenotypic manifestations of clinical drug toxicity in humans, are a major concern in precision clinical medicine. A comprehensive evaluation of ADRs is helpful for unbiased supervision of marketed drugs and for discovering new drugs with high success rates. OBJECTIVE: In current practice, drug safety evaluation is often oversimplified to the occurrence or nonoccurrence of ADRs. Given the limitations of current qualitative methods, there is an urgent need for a quantitative evaluation model to improve pharmacovigilance and the accurate assessment of drug safety. METHODS: In this study, we developed a mathematical model, namely the Adverse Drug Reaction Classification System (ADReCS) severity-grading model, for the quantitative characterization of ADR severity, a crucial feature for evaluating the impact of ADRs on human health. The model was constructed by mining millions of real-world historical adverse drug event reports. A new parameter called Severity_score was introduced to measure the severity of ADRs, and upper and lower score boundaries were determined for 5 severity grades. RESULTS: The ADReCS severity-grading model exhibited excellent consistency (99.22%) with the expert-grading system, the Common Terminology Criteria for Adverse Events. Hence, we graded the severity of 6277 standard ADRs for 129,407 drug-ADR pairs. Moreover, we calculated the occurrence rates of 6272 distinct ADRs for 127,763 drug-ADR pairs in large patient populations by mining real-world medication prescriptions. With the quantitative features, we demonstrated example applications in systematically elucidating ADR mechanisms and thereby discovered a list of drugs with improper dosages. CONCLUSIONS: In summary, this study represents the first comprehensive determination of both ADR severity grades and ADR frequencies. This endeavor establishes a strong foundation for future artificial intelligence applications in discovering new drugs with high efficacy and low toxicity. It also heralds a paradigm shift in clinical toxicity research, moving from qualitative description to quantitative evaluation.


Subject(s)
Big Data , Data Mining , Drug-Related Side Effects and Adverse Reactions , Humans , Data Mining/methods , Pharmacovigilance , Models, Theoretical , Adverse Drug Reaction Reporting Systems/statistics & numerical data
5.
PLoS One ; 19(5): e0301608, 2024.
Article in English | MEDLINE | ID: mdl-38691555

ABSTRACT

The application of pattern mining algorithms to extract movement patterns from sports big data can improve training specificity by facilitating a more granular evaluation of movement. Since movement patterns can only occur as consecutive, non-consecutive, or non-sequential, this study aimed to identify the best set of movement patterns for player movement profiling in professional rugby league and quantify the similarity among distinct movement patterns. Three pattern mining algorithms (l-length Closed Contiguous [LCCspm], Longest Common Subsequence [LCS] and AprioriClose) were used to extract patterns to profile elite rugby football league hookers (n = 22 players) and wingers (n = 28 players) match-games movements across 319 matches. Jaccard similarity score was used to quantify the similarity between algorithms' movement patterns and machine learning classification modelling identified the best algorithm's movement patterns to separate playing positions. LCCspm and LCS movement patterns shared a 0.19 Jaccard similarity score. AprioriClose movement patterns shared no significant Jaccard similarity with LCCspm (0.008) and LCS (0.009) patterns. The closed contiguous movement patterns profiled by LCCspm best-separated players into playing positions. Multi-layered Perceptron classification algorithm achieved the highest accuracy of 91.02% and precision, recall and F1 scores of 0.91 respectively. Therefore, we recommend the extraction of closed contiguous (consecutive) over non-consecutive and non-sequential movement patterns for separating groups of players.


Subject(s)
Algorithms , Football , Movement , Humans , Football/physiology , Movement/physiology , Athletic Performance/physiology , Male , Machine Learning , Athletes , Data Mining/methods , Adult , Rugby
6.
Health Informatics J ; 30(2): 14604582241240680, 2024.
Article in English | MEDLINE | ID: mdl-38739488

ABSTRACT

Objective: This study examined major themes and sentiments and their trajectories and interactions over time using subcategories of Reddit data. The aim was to facilitate decision-making for psychosocial rehabilitation. Materials and Methods: We utilized natural language processing techniques, including topic modeling and sentiment analysis, on a dataset consisting of more than 38,000 topics, comments, and posts collected from a subreddit dedicated to the experiences of people who tested positive for COVID-19. In this longitudinal exploratory analysis, we studied the dynamics between the most dominant topics and subjects' emotional states over an 18-month period. Results: Our findings highlight the evolution of the textual and sentimental status of major topics discussed by COVID survivors over an extended period of time during the pandemic. We particularly studied pre- and post-vaccination eras as a turning point in the timeline of the pandemic. The results show that not only does the relevance of topics change over time, but the emotions attached to them also vary. Major social events, such as the administration of vaccines or enforcement of nationwide policies, are also reflected through the discussions and inquiries of social media users. In particular, the emotional state (i.e., sentiments and polarity of their feelings) of those who have experienced COVID personally. Discussion: Cumulative societal knowledge regarding the COVID-19 pandemic impacts the patterns with which people discuss their experiences, concerns, and opinions. The subjects' emotional state with respect to different topics was also impacted by extraneous factors and events, such as vaccination. Conclusion: By mining major topics, sentiments, and trajectories demonstrated in COVID-19 survivors' interactions on Reddit, this study contributes to the emerging body of scholarship on COVID-19 survivors' mental health outcomes, providing insights into the design of mental health support and rehabilitation services for COVID-19 survivors.


Subject(s)
COVID-19 , SARS-CoV-2 , Survivors , Humans , COVID-19/psychology , COVID-19/epidemiology , Survivors/psychology , Data Mining/methods , Pandemics , Natural Language Processing , Social Media/trends , Longitudinal Studies
7.
Sensors (Basel) ; 24(9)2024 Apr 30.
Article in English | MEDLINE | ID: mdl-38732962

ABSTRACT

Being motivated has positive influences on task performance. However, motivation could result from various motives that affect different parts of the brain. Analyzing the motivation effect from all affected areas requires a high number of EEG electrodes, resulting in high cost, inflexibility, and burden to users. In various real-world applications, only the motivation effect is required for performance evaluation regardless of the motive. Analyzing the relationships between the motivation-affected brain areas associated with the task's performance could limit the required electrodes. This study introduced a method to identify the cognitive motivation effect with a reduced number of EEG electrodes. The temporal association rule mining (TARM) concept was used to analyze the relationships between attention and memorization brain areas under the effect of motivation from the cognitive motivation task. For accuracy improvement, the artificial bee colony (ABC) algorithm was applied with the central limit theorem (CLT) concept to optimize the TARM parameters. From the results, our method can identify the motivation effect with only FCz and P3 electrodes, with 74.5% classification accuracy on average with individual tests.


Subject(s)
Algorithms , Cognition , Electroencephalography , Motivation , Motivation/physiology , Electroencephalography/methods , Humans , Cognition/physiology , Male , Adult , Female , Brain/physiology , Young Adult , Electrodes , Data Mining/methods
8.
Support Care Cancer ; 32(5): 314, 2024 Apr 29.
Article in English | MEDLINE | ID: mdl-38683417

ABSTRACT

PURPOSE: This study aimed to assess the different needs of patients with breast cancer and their families in online health communities at different treatment phases using a Latent Dirichlet Allocation (LDA) model. METHODS: Using Python, breast cancer-related posts were collected from two online health communities: patient-to-patient and patient-to-doctor. After data cleaning, eligible posts were categorized based on the treatment phase. Subsequently, an LDA model identifying the distinct need-related topics for each phase of treatment, including data preprocessing and LDA topic modeling, was established. Additionally, the demographic and interactive features of the posts were manually analyzed. RESULTS: We collected 84,043 posts, of which 9504 posts were included after data cleaning. Early diagnosis and rehabilitation treatment phases had the highest and lowest number of posts, respectively. LDA identified 11 topics: three in the initial diagnosis phase and two in each of the remaining treatment phases. The topics included disease outcomes, diagnosis analysis, treatment information, and emotional support in the initial diagnosis phase; surgical options and outcomes, postoperative care, and treatment planning in the perioperative treatment phase; treatment options and costs, side effects management, and disease prognosis assessment in the non-operative treatment phase; diagnosis and treatment options, disease prognosis, and emotional support in the relapse and metastasis treatment phase; and follow-up and recurrence concerns, physical symptoms, and lifestyle adjustments in the rehabilitation treatment phase. CONCLUSION: The needs of patients with breast cancer and their families differ across various phases of cancer therapy. Therefore, specific information or emotional assistance should be tailored to each phase of treatment based on the unique needs of patients and their families.


Subject(s)
Breast Neoplasms , Data Mining , Humans , Breast Neoplasms/psychology , Breast Neoplasms/therapy , Breast Neoplasms/rehabilitation , Female , Data Mining/methods , Needs Assessment , Internet
9.
Zhongguo Zhong Yao Za Zhi ; 49(3): 836-841, 2024 Feb.
Article in Chinese | MEDLINE | ID: mdl-38621887

ABSTRACT

This study aims to construct the element relationship and extension path of clinical evidence knowledge map with Chinese patent medicine, providing basic technical support for the formation and transformation of the evidence chain of Chinese patent medicine and providing collection, induction, and summary schemes for massive and disorganized clinical data. Based on the elements of evidence-based PICOS, the conventional construction methods of knowledge graph were collected and summarized. Firstly, the data entities related to Chinese patent medicine were classified, and entity linking was performed(disambiguation). Secondly, the study associated and classified the attribute information of the data entity. Finally, the logical relationship between entities was constructed, and then the element relationship and extension path of the knowledge map conforming to the characteristics of clinical evidence of Chinese patent medicine were summarized. The construction of the clinical evidence knowledge map of Chinese patent medicine was mainly based on process design and logical structure, and the element relationship of the knowledge map was expressed according to the PICOS principle and evidence level. The extension path crossed three levels(model layer, data layer application, and new evidence application), and the study gradually explored the path from disease, core evaluation indicators, Chinese patent medicine, core prescriptions, syndrome and treatment rules, and medical case comparison(evolution law) to new drug research and development. In this study, the top-level design of the construction of the clinical evidence knowledge map of Chinese patent medicine has been clarified, but it still needs the joint efforts of interdisciplinary disciplines. With the continuous improvement of the map construction technology in line with the characteristics of TCM, the study can provide necessary basic technical support and reference for the development of the TCM discipline.


Subject(s)
Drugs, Chinese Herbal , Drugs, Chinese Herbal/therapeutic use , Medicine, Chinese Traditional , Nonprescription Drugs/therapeutic use , Technology , Data Mining/methods
10.
BMC Med Inform Decis Mak ; 24(Suppl 3): 98, 2024 Apr 17.
Article in English | MEDLINE | ID: mdl-38632621

ABSTRACT

BACKGROUND: Tremendous research efforts have been made in the Alzheimer's disease (AD) field to understand the disease etiology, progression and discover treatments for AD. Many mechanistic hypotheses, therapeutic targets and treatment strategies have been proposed in the last few decades. Reviewing previous work and staying current on this ever-growing body of AD publications is an essential yet difficult task for AD researchers. METHODS: In this study, we designed and implemented a natural language processing (NLP) pipeline to extract gene-specific neurodegenerative disease (ND) -focused information from the PubMed database. The collected publication information was filtered and cleaned to construct AD-related gene-specific publication profiles. Six categories of AD-related information are extracted from the processed publication data: publication trend by year, dementia type occurrence, brain region occurrence, mouse model information, keywords occurrence, and co-occurring genes. A user-friendly web portal is then developed using Django framework to provide gene query functions and data visualizations for the generalized and summarized publication information. RESULTS: By implementing the NLP pipeline, we extracted gene-specific ND-related publication information from the abstracts of the publications in the PubMed database. The results are summarized and visualized through an interactive web query portal. Multiple visualization windows display the ND publication trends, mouse models used, dementia types, involved brain regions, keywords to major AD-related biological processes, and co-occurring genes. Direct links to PubMed sites are provided for all recorded publications on the query result page of the web portal. CONCLUSION: The resulting portal is a valuable tool and data source for quick querying and displaying AD publications tailored to users' interested research areas and gene targets, which is especially convenient for users without informatic mining skills. Our study will not only keep AD field researchers updated with the progress of AD research, assist them in conducting preliminary examinations efficiently, but also offers additional support for hypothesis generation and validation which will contribute significantly to the communication, dissemination, and progress of AD research.


Subject(s)
Alzheimer Disease , Neurodegenerative Diseases , Animals , Mice , Data Mining/methods , PubMed , Databases, Factual
11.
PLoS Comput Biol ; 20(4): e1011989, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38626249

ABSTRACT

Biomedical texts provide important data for investigating drug-drug interactions (DDIs) in the field of pharmacovigilance. Although researchers have attempted to investigate DDIs from biomedical texts and predict unknown DDIs, the lack of accurate manual annotations significantly hinders the performance of machine learning algorithms. In this study, a new DDI prediction framework, Subgraph Enhance model, was developed for DDI (SubGE-DDI) to improve the performance of machine learning algorithms. This model uses drug pairs knowledge subgraph information to achieve large-scale plain text prediction without many annotations. This model treats DDI prediction as a multi-class classification problem and predicts the specific DDI type for each drug pair (e.g. Mechanism, Effect, Advise, Interact and Negative). The drug pairs knowledge subgraph was derived from a huge drug knowledge graph containing various public datasets, such as DrugBank, TwoSIDES, OffSIDES, DrugCentral, EntrezeGene, SMPDB (The Small Molecule Pathway Database), CTD (The Comparative Toxicogenomics Database) and SIDER. The SubGE-DDI was evaluated from the public dataset (SemEval-2013 Task 9 dataset) and then compared with other state-of-the-art baselines. SubGE-DDI achieves 83.91% micro F1 score and 84.75% macro F1 score in the test dataset, outperforming the other state-of-the-art baselines. These findings show that the proposed drug pairs knowledge subgraph-assisted model can effectively improve the prediction performance of DDIs from biomedical texts.


Subject(s)
Algorithms , Computational Biology , Drug Interactions , Machine Learning , Computational Biology/methods , Humans , Pharmacovigilance , Databases, Factual , Data Mining/methods
12.
J Med Syst ; 48(1): 47, 2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38662184

ABSTRACT

Ontologies serve as comprehensive frameworks for organizing domain-specific knowledge, offering significant benefits for managing clinical data. This study presents the development of the Fall Risk Management Ontology (FRMO), designed to enhance clinical text mining, facilitate integration and interoperability between disparate data sources, and streamline clinical data analysis. By representing major entities within the fall risk management domain, the FRMO supports the unification of clinical language and decision-making processes, ultimately contributing to the prevention of falls among older adults. We used Ontology Web Language (OWL) to build the FRMO in Protégé. Of the seven steps of the Stanford approach, six steps were utilized in the development of the FRMO: (1) defining the domain and scope of the ontology, (2) reusing existing ontologies when possible, (3) enumerating ontology terms, (4) specifying the classes and their hierarchy, (5) defining the properties of the classes, and (6) defining the facets of the properties. We evaluated the FRMO using four main criteria: consistency, completeness, accuracy, and clarity. The developed ontology comprises 890 classes arranged in a hierarchical structure, including six top-level classes with a total of 43 object properties and 28 data properties. FRMO is the first comprehensively described semantic ontology for fall risk management. Healthcare providers can use the ontology as the basis of clinical decision technology for managing falls among older adults.


Subject(s)
Accidental Falls , Data Mining , Risk Management , Accidental Falls/prevention & control , Humans , Data Mining/methods , Biological Ontologies , Electronic Health Records/organization & administration , Semantics
13.
Artif Intell Med ; 151: 102847, 2024 May.
Article in English | MEDLINE | ID: mdl-38658131

ABSTRACT

Building clinical registries is an important step in clinical research and improvement of patient care quality. Natural Language Processing (NLP) methods have shown promising results in extracting valuable information from unstructured clinical notes. However, the structure and nature of clinical notes are very different from regular text that state-of-the-art NLP models are trained and tested on, and they have their own set of challenges. In this study, we propose Sentence Extractor with Keywords (SE-K), an efficient and interpretable classification approach for extracting information from clinical notes and show that it outperforms more computationally expensive methods in text classification. Following the Institutional Review Board (IRB) approval, we used SE-K and two embedding based NLP approaches (Sentence Extractor with Embeddings (SE-E) and Bidirectional Encoder Representations from Transformers (BERT)) to develop comprehensive registry of anterior cruciate ligament surgeries from 20 years of unstructured clinical data at a multi-site tertiary-care regional children's hospital. The low-resource approach (SE-K) had better performance (average AUROC of 0.94 ± 0.04) than the embedding-based approaches (SE-E: 0.93 ± 0.04 and BERT: 0.87 ± 0.09) for out of sample validation, in addition to minimum performance drop between test and out-of-sample validation. Moreover, the SE-K approach was at least six times faster (on CPU) than SE-E (on CPU) and BERT (on GPU) and provides interpretability. Our proposed approach, SE-K, can be effectively used to extract relevant variables from clinic notes to build large-scale registries, with consistently better performance compared to the more resource-intensive approaches (e.g., BERT). Such approaches can facilitate information extraction from unstructured notes for registry building, quality improvement and adverse event monitoring.


Subject(s)
Natural Language Processing , Registries , Humans , Electronic Health Records , Data Mining/methods
14.
Food Chem Toxicol ; 187: 114638, 2024 May.
Article in English | MEDLINE | ID: mdl-38582341

ABSTRACT

With a society increasingly demanding alternative protein food sources, new strategies for evaluating protein safety issues, such as allergenic potential, are needed. Large-scale and systemic studies on allergenic proteins are hindered by the limited and non-harmonized clinical information available for these substances in dedicated databases. A missing key information is that representing the symptomatology of the allergens, especially given in terms of standard vocabularies, that would allow connecting with other biomedical resources to carry out different studies related to human health. In this work, we have generated the first resource with a comprehensive annotation of allergens' symptomatology, using a text-mining approach that extracts significant co-mentions between these entities from the scientific literature (PubMed, ∼36 million abstracts). The method identifies statistically significant co-mentions between the textual descriptions of the two types of entities in the literature as indication of relationship. 1,180 clinical signs extracted from the Human Phenotype Ontology, the Medical Subject Heading terms of PubMed together with other allergen-specific symptoms, were linked to 1,036 unique allergens annotated in two main allergen-related public databases via 14,009 relationships. This novel resource, publicly available through an interactive web interface, could serve as a starting point for future manually curated compilation of allergen symptomatology.


Subject(s)
Allergens , Data Mining , Humans , Data Mining/methods , Databases, Factual , Proteins/metabolism
15.
Sci Rep ; 14(1): 7635, 2024 04 01.
Article in English | MEDLINE | ID: mdl-38561391

ABSTRACT

Extracting knowledge from hybrid data, comprising both categorical and numerical data, poses significant challenges due to the inherent difficulty in preserving information and practical meanings during the conversion process. To address this challenge, hybrid data processing methods, combining complementary rough sets, have emerged as a promising approach for handling uncertainty. However, selecting an appropriate model and effectively utilizing it in data mining requires a thorough qualitative and quantitative comparison of existing hybrid data processing models. This research aims to contribute to the analysis of hybrid data processing models based on neighborhood rough sets by investigating the inherent relationships among these models. We propose a generic neighborhood rough set-based hybrid model specifically designed for processing hybrid data, thereby enhancing the efficacy of the data mining process without resorting to discretization and avoiding information loss or practical meaning degradation in datasets. The proposed scheme dynamically adapts the threshold value for the neighborhood approximation space according to the characteristics of the given datasets, ensuring optimal performance without sacrificing accuracy. To evaluate the effectiveness of the proposed scheme, we develop a testbed tailored for Parkinson's patients, a domain where hybrid data processing is particularly relevant. The experimental results demonstrate that the proposed scheme consistently outperforms existing schemes in adaptively handling both numerical and categorical data, achieving an impressive accuracy of 95% on the Parkinson's dataset. Overall, this research contributes to advancing hybrid data processing techniques by providing a robust and adaptive solution that addresses the challenges associated with handling hybrid data, particularly in the context of Parkinson's disease analysis.


Subject(s)
Algorithms , Parkinson Disease , Humans , Data Mining/methods , Uncertainty
16.
Stud Health Technol Inform ; 313: 74-80, 2024 Apr 26.
Article in English | MEDLINE | ID: mdl-38682508

ABSTRACT

While adherence to clinical guidelines improves the quality and consistency of care, personalized healthcare also requires a deep understanding of individual disease models and treatment plans. The structured preparation of medical routine data in a certain clinical context, e.g. a treatment pathway outlined in a medical guideline, is currently a challenging task. Medical data is often stored in diverse formats and systems, and the relevant clinical knowledge defining the context is not available in machine-readable formats. We present an approach to extract information from medical free text documentation by using structured clinical knowledge to guide information extraction into a structured and encoded format, overcoming the known challenges for natural language processing algorithms. Preliminary results have been encouraging, as one of our methods managed to extract 100% of all data-points with 85% accuracy in details. These advancements show the potential of our approach to effectively use unstructured clinical data to elevate the quality of patient care and reduce the workload of medical personnel.


Subject(s)
Electronic Health Records , Natural Language Processing , Humans , Data Mining/methods , Information Storage and Retrieval/methods , Algorithms
17.
Clin Biochem ; 127-128: 110762, 2024 May.
Article in English | MEDLINE | ID: mdl-38582381

ABSTRACT

BACKGROUND: This study aims to investigate the impact of age and sex on high-sensitivity cardiac troponin T (hs-cTnT) and establish 99th percentile upper reference limits (URLs) in older individuals utilizing large-scale real-world data. METHODS: 40,530 outpatient hs-cTnT results were obtained from the laboratory database from January 1, 2018, to December 31, 2023. Our study included 4,199 elderly outpatients (aged ≥ 60) without cardiovascular disease or other heart-related chronic conditions. Nested analysis of variance was used to explore the necessity of partitioning reference intervals (RIs) by sex and age groups. RIs were established by the refineR algorithm and assessed based on ≤ 10% test results of validation data set outside the new RIs. RESULTS: RIs for hs-cTnT in the older population needed to be partitioned by sex and age groups ([standard deviation ratio] SDRage = 0.75; SDRsex = 0.49). URLs in older Chinese adults were 21.8 ng/L for males, 16.5 ng/L for females, and 20.7 ng/L for the overall participant group. URLs for males aged 60-69, 70-79, and ≥ 80 were 13.7, 19.4, and 31.0 ng/L, respectively. Female values were 10.1, 17.2, and 22.0 ng/L. Importantly, manufacturer-reported RIs do not suffice for Chinese individuals aged ≥ 70. Validation data showed that 2.7-5.2% of test results fell outside the new RIs, confirming the validity of the results. CONCLUSION: This study establishes age- and sex-specific 99th percentile URLs for hs-cTnT in Chinese older individuals, thereby enhancing the accuracy of clinical assessments.


Subject(s)
Data Mining , Troponin T , Humans , Troponin T/blood , Female , Male , Aged , Aged, 80 and over , Middle Aged , Reference Values , Sex Factors , Data Mining/methods , China , Age Factors , Asian People , East Asian People
18.
Methods Mol Biol ; 2787: 3-38, 2024.
Article in English | MEDLINE | ID: mdl-38656479

ABSTRACT

In this chapter, we explore the application of high-throughput crop phenotyping facilities for phenotype data acquisition and the extraction of significant information from the collected data through image processing and data mining methods. Additionally, the construction and outlook of crop phenotype databases are introduced and the need for global cooperation and data sharing is emphasized. High-throughput crop phenotyping significantly improves accuracy and efficiency compared to traditional measurements, making significant contributions to overcoming bottlenecks in the phenotyping field and advancing crop genetics.


Subject(s)
Crops, Agricultural , Data Mining , Image Processing, Computer-Assisted , Phenotype , Crops, Agricultural/genetics , Crops, Agricultural/growth & development , Data Mining/methods , Image Processing, Computer-Assisted/methods , Data Management/methods , High-Throughput Screening Assays/methods
19.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38597890

ABSTRACT

MOTIVATION: The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS: We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e. in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION: https://netme.click/.


Subject(s)
Internet , Software , Data Mining/methods , Computational Biology/methods , PubMed
20.
Bioinformatics ; 40(4)2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38608194

ABSTRACT

MOTIVATION: Dysregulation of a gene's function, either due to mutations or impairments in regulatory networks, often triggers pathological states in the affected tissue. Comprehensive mapping of these apparent gene-pathology relationships is an ever-daunting task, primarily due to genetic pleiotropy and lack of suitable computational approaches. With the advent of high throughput genomics platforms and community scale initiatives such as the Human Cell Landscape project, researchers have been able to create gene expression portraits of healthy tissues resolved at the level of single cells. However, a similar wealth of knowledge is currently not at our finger-tip when it comes to diseases. This is because the genetic manifestation of a disease is often quite diverse and is confounded by several clinical and demographic covariates. RESULTS: To circumvent this, we mined ∼18 million PubMed abstracts published till May 2019 and automatically selected ∼4.5 million of them that describe roles of particular genes in disease pathogenesis. Further, we fine-tuned the pretrained bidirectional encoder representations from transformers (BERT) for language modeling from the domain of natural language processing to learn vector representation of entities such as genes, diseases, tissues, cell-types, etc., in a way such that their relationship is preserved in a vector space. The repurposed BERT predicted disease-gene associations that are not cited in the training data, thereby highlighting the feasibility of in silico synthesis of hypotheses linking different biological entities such as genes and conditions. AVAILABILITY AND IMPLEMENTATION: PathoBERT pretrained model: https://github.com/Priyadarshini-Rai/Pathomap-Model. BioSentVec-based abstract classification model: https://github.com/Priyadarshini-Rai/Pathomap-Model. Pathomap R package: https://github.com/Priyadarshini-Rai/Pathomap.


Subject(s)
Data Mining , Humans , Data Mining/methods , Computational Biology/methods , Natural Language Processing
SELECTION OF CITATIONS
SEARCH DETAIL
...